What is the "Bitter Lesson"?
The Bitter Lesson is a thesis introduced by Rich Sutton
Computer scientist who is considered one of the founders of the field reinforcement learning. He is known for popularizing the “scaling hypothesis” as well as the “bitter lesson”.
Historically, AI research has mostly designed systems to use a fixed amount of computing power, improving their performance by applying domain-specific human knowledge. In theory, this approach is compatible with also improving performance by scaling (increasing computing power), but in practice, the complications introduced by leveraging human ingenuity make it harder to also leverage computation. Available computing power keeps growing steadily in accordance with Moore’s law, and past trends suggest that leveraging this growth is what increases performance in the long run.
Some fields that Sutton cites as examples of the Bitter Lesson are:
-
Games: In chess, Deep Blue beat world champion Garry Kasparov by using enormous computing power (for the time) to search deeply through the tree of possible moves. Similarly, in Go, AlphaGo beat world champion Lee Sedol using deep learning plus Monte Carlo tree search to find its moves, instead of using human-engineered Go techniques. Less than a year later, AlphaZero beat AlphaGo using self-play, without using human-generated Go data at all. None of these advances relied on humans coding in a deeper strategic understanding
-
Vision: Early computer vision methods worked with human-engineered features and convolution kernels to perform image recognition tasks, but it was later found that leveraging more compute and letting convolutional neural nets (CNNs) learn their own features yields much better performance.
Modern AI has learned to favor general-purpose methods of search and learning which continue to scale with increasing compute. Over the last couple of generations of transformer models, simply scaling models has been so effective that it has led OpenAI to propose scaling laws The relationship between a model’s performance and the amount of compute used to train it.